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,-H Abstract 

o 

f»o Suppose a client, Alice, has outsourced her data to an external storage provider. Bob, because he 

K^ has capacity for her massive data set, of size n, whereas her private storage is much smaller — say, of 

jrt size 0(n^/''), for some constant r > 1. Alice trusts Bob to maintain her data, but she would like to 

^) keep its contents private. She can encrypt her data, of course, but she also wishes to keep her access 

^^ patterns hidden from Bob as well. We describe schemes for the oblivious RAM simulation problem with 

04 a small logarithmic or polylogarithmic amortized increase in access times, with a very high probability 

^^ of success, while keeping the external storage to be of size 0{n). To achieve this, our algorithmic 

jy) contributions include a parallel MapReduce cuckoo-hashing algorithm and an external-memory data- 

^**^ oblivious sorting algorithm. 
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1 Introduction 

> 

Q^ Suppose Alice owns a large data set, which she outsources to an honest-but-curious server, Bob. Alice trusts 

lO Bob to reliably maintain her data, to update it as requested, and to accurately answer queries on this data. 

^ But she does not trust Bob to keep her information confidential, either because he has a commercial interest 

• in her data or because she fears security leaks by Bob (or his employees) that could allow third parties to 

(^ learn about Alice's data. Such suspicions are well-founded, of course, since both scenarios are quite possible 

O nowadays with respect to well-known commercial data outsourcers and email services. 

,, For the sake of privacy, Alice can, of course, encrypt the cells of the data she stores with Bob. But 

^ ^ encryption is not enough, as Alice can inadvertently reveal information about her data based on how she 

S^ accesses it. For example, Chen et al. {VT\ show that sensitive information can be inferred from the access 

H patterns of popular health and financial web sites even if the contents of these communications are encrypted. 

Thus, we desire that Alice's access sequence (of memory reads and writes) is data-oblivious, that is, the 

probability distribution for Alice's access sequence should depend only on the size, n, of the data set and the 

number of memory accesses. Formally, we say a computation is data-oblivious if Pr (5 \M), the probability 

that Bob sees an access sequence, S, conditioned on a specific configuration of his memory (Alice's 

outsourced memory), M., satisfies Pr(5' | M.) = Pr(S' | M'), for any memory configuration AA' ^ M. 

such that \M'\ = \M.\- In particular, Alice's access sequence should not depend on the values of any set 

of memory cells in the outsourced memory that Bob maintains for Alice. To provide for full application 

generality, we assume outsourced data is indexed and we allow Alice to make arbitrary indexed accesses 

to this data for queries and updates. That is, let us assume this outsourced data model is as general as the 

random-access machine (RAM) model. By accessing her data in a data-oblivious fashion, Alice guarantees 

that no inferences about her computations are possible beyond those inferable from time and space bounds, 

which themselves can be masked through padding. 



Most computations that Alice would be likely to perform on her outsourced data are not naturally data- 
oblivious. We are therefore interested in this paper in simulation schemes that would allow Alice to make 
her access sequence data-oblivious with low overhead. For this problem, which is known as oblivious RAM 
simulation ||2T]| . we are primarily interested in the case where Alice has a relatively small private memory, 
say, of constant size or size that is 0(n^' '), for a given constant r > 1. 

1.1 Our Results 

In this paper, we show how Alice can perform an oblivious RAM simulation, with very high probabilitjQ 
with an amortized time overhead of O(logn) and with 0{n) storage overhead purchased from Bob, while 
using a private memory of size 0{n^''^), for a given constant r > 1. With a constant-sized memory, we 
show that she can do this simulation with overhead 0(log^ n), with a similarly high probability of success. 
At a high level, our result shows that Alice can leverage the privacy of her small memory to achieve privacy 
in her much larger outsourced data set of size n. For instance, with r = 4, Alice could use a private memory 
of order one megabyte to achieve privacy in an outsourced memory with size on the order of one yottabyte. 
Interestingly, our techniques involve the interplay of some seemingly unrelated new results, which may 
be of independent interest, including an efficient MapReduce parallel algorithm for cuckoo hashing and a 
novel deterministic data-oblivious external-memory sorting algorithm. We combine these results to achieve 
our improved oblivious RAM simulations by exploiting some of the unique performance characteristics of 
cuckoo hashing together with efficient data-oblivious methods for performing cuckoo hashing. 

1.2 Previous Related Results 

Pippenger and Fischer |[38l are the first to study the simulation of one computation in a given model with an 
oblivious one in a related model, in that they show how to simulate a one-tape Turing machine computation 
of length n with an oblivious two-tape Turing machine computation of length 0(n log n), i.e., with an 
O(logn) overhead. 

Goldreich and Ostrovsky ||2TI introduce the oblivious RAM simulation problem and show that it requires 
an overhead of 0(log n) under some reasonable assumptions about the nature of such simulations. For the 
case where Alice has only a constant-size private memory, they show how Ahce can easily achieve an 
overhead of 0{n^'^ log n), with 0{n) storage at Bob's server, and, with a more complicated scheme, how 
Alice can achieve an overhead of 0(Iog^ n) with 0{n log n) storage at Bob's server. 

Williams and Sion 1*471 study the oblivious RAM simulation problem for the case when the data owner, 
Alice, has a private memory of size 0{n^''^), achieving an expected amortized time overhead of 0(log^ n) 
using 0(n log n) memory at the data provider. Incidentally, Williams et al. P3]| claim an alternative 
method that uses an 0(n^' ^)-sized private memory and achieves 0(log n log log n) amortized time overhead 
with a linear-sized outsourced storage, but some researchers (e.g., see f3T|) have raised concerns with the 
assumptions and analysis of this result. 

The results of this paper were posted by the authors in preliminary form as Il24ll . Independently, Pinkas 
and Reinman |[37l published an oblivious RAM simulation result for the case where Alice maintains a 
constant-size private memory, claiming that Alice can achieve an expected amortized overhead of 0(log^ n) 
while using 0{n) storage space at the data outsourcer. Bob. Unfortunately, their construction contains a 
flaw that allows Bob to learn Alice's access sequence, with high probability, in some cases, which our 
construction avoids. 

Ajtai f3l shows how oblivious RAM simulation can be done with a polylogarithmic factor overhead 
without cryptographic assumptions about the existence of random hash functions, as is done in previous 



'We show that our simulation fails to be oblivious with negligible probability; that is, the probability that the algorithm fails can 
be shown to be O (^) for any a > 0. We say an event holds with very high probability if it fails with negligible probability. 



papers ||2T1 [37l l44l l45l . as well as any paper that derives its security or privacy from the random oracle 
model (including this paper). A similar result is also given by Damgard et al. lfT3l . Although these results 
address interesting theoretical limits of what is achievable without random hash functions, we feel that the 
assumption about the existence of random hash functions is actually not a major obstacle in practice, given 
the ubiquitous use of cryptographic hash functions. 

In addition to the previous upper-bound results, Beame and Machmouchi f9l show that if the additional 
space utilized at the data outsourcer (besides the space for the data itself) is sufficiently sublinear, then the 
overhead for oblivious RAM simulation has a superlogarithmic lower bound. We note that our logarithmic- 
overhead result does not contradict this lower bound, however, since we assume Alice can afford to pay for 
0{n) additional space at the data outsourcer on top of the space for her n data items (but she would prefer, 
say, not to pay for 0{n log n) additional space). 

2 Preliminaries 

Before we give the details for our oblivious RAM simulation result, we need to review some results regarding 
some other topics. 

2.1 A Review of Cuckoo Hashing 



Pagh and Rodler 1361 introduce cuckoo hashing, which is a hashing scheme using two tables, each with m 
cells, and two hash functions, hi and /12, one for each table, where we assume hi and /i2 can be modeled as 
random hash functions for the sake of analysis. The tables store n = (1 — e)m keys, where one key can be 
held in each cell, for a constant e < 1. Note that the total load factor is less than 1/2. Keys can be inserted 
or deleted over time; the requirement is that at most n = (1 — e)m distinct keys are stored at any time. A 
stored key x should be located at either hi {x) or /12 (x), and, hence, lookups take constant time. On insertion 
of a new key x, cell hi {x) is examined. If this cell is empty, x is placed in this cell and the operation is 
complete. Otherwise, x is placed in this cell and the key y already in the cell is moved to /i2(y)- This may 
in turn cause another key to be moved, and so on. We say that a failure occurs if, for an appropriate constant 
Co, after cq log n steps this process has not successfully terminated with all keys located in an appropriate 
cell. Suppose we insert an nth key into the system. Well-known attributes of cuckoo hashing include: 

• The expected time to insert a new key is bounded above by a constant. 

• The probability a new key causes a failure is Q{\/n?). 

Since their introduction, several variations of cuckoo hashing have been developed and studied. For 
instance, one can use asymmetrically sized tables, or use a single table for both hash functions. Using 
more than two choices, or d-ary cuckoo hashing, allows for higher load factors but requires more complex 
insertion mechanisms lITTll . Similarly, one can store more than one key in a bucket to obtain higher loads 
at the expense of more complex insertion mechanisms f20l. Other variations and uses of cuckoo hashing 
are described in [34|. Our work will focus on standard cuckoo hashing, for which we describe a natural 
parallelization with provable performance guarantees. Kirsch, Mitzenmacher, and Wieder introduce the idea 
of utilizing a stash ll30ll . A stash can be thought of as an additional memory where keys that would otherwise 
cause a failure can be placed. In such a setting, a failure occurs only if the stash itself overflows. For n 
items inserted in a two-table cuckoo hash table, the total failure probability can be reduced to 0{\/n^~^^) 
for any constant k using a stash that can hold k keys. We can use this fact to handle polynomial-time 
computations using cuckoo hash tables; any fixed polynomial number of inserts and deletes can be handled 
with a constant-sized stash, at the expense of having to check the stash at each step. For our results. 



we require a generalization of this result to stashes that are larger than constant sized. We present this 
generalization in Appendix |C] (and make note of where it is required). 

There has not been substantial previous work specifically on parallel cuckoo hashing, although histori- 
cally there has been substantial work on parallel hashing schemes for load balancing (see, e.g., llTl [T6ll29ll ). 
Recently, Alcantara et al. ||21 sketch a parallel construction method for a three-table cuckoo hashing scheme, 
as well as giving a hybrid parallel-sequential cuckoo hashing method for GPU computing. Although they do 
not theoretically analyze their fully-parallel algorithm, it is less efficient in terms of its overall work bounds; 
hence, it is less suitable for oblivious simulation than the two-table parallel cuckoo hashing method we 
describe here. Later in this paper, we provide a parallel algorithm for cuckoo hashing that runs in 0(log n) 
time and requires 0{n) total work, with very high probability, by applying tail bounds on the performance 
of general cuckoo hashing to a novel construction in the MapReduce parallel model. 

2.2 The MapReduce Parallel Model 

Historically, the PRAM model fTTl and the bulk-synchronous parallel (BSP) model POll have provided 
general and powerful parallel models of computation. In addition, the BSP model has, more recently, been 
shown to be a starting point for models for GPU programming [261, cloud computing |[39l . and multi-core 
architectures [41]. Nevertheless, the power and generality of these models makes them challenging to use 
as parallel programming paradigms. 

As an alternative, the MapReduce programming paradigm has been introduced to provide a simple ap- 
proach to parallel programming (e.g., see llT4l|T9ll28l ). This programming paradigm is gaining widespread 
interest, as it is used in Google data centers and has been implemented in the open-source Hadoop sys- 
tem [43] for server clusters. The MapReduce paradigm has also been recently formalized in parallel 
computing models as well |[T9ll28l . It is based on the specification of a computation in terms of a sequence 
of map, shuffle, and reduce steps. 

2.3 Data-Oblivious Sorting 

Data-oblivious sorting has a long and interesting history, starting with the earliest methods for constructing 
sorting networks (e.g., see [31|). Thus, for oblivious RAM sorting, there are several existing algorithms. For 
instance, if one is interested in an asymptotically optimal method, then one can use the AKS sorting network, 
which is a deterministic data-oblivious sorting method that runs in 0{n\ogn) time [01. Unfortunately, this 
method is not very practical, so in practice it would probably be more advantageous to use either a simple 
deterministic suboptimal method, like odd-even mergesort [8|, which runs in 0(nlog^ n) time, or a simple 
randomized method, like randomized Shellsort ||23l . which runs in 0(n log n) time and sorts with very high 
probability. Both of these methods are data oblivious. 

For our main result, however, we desire an efficient method for data-oblivious sorting in the external- 
memory model. In this model, memory is divided between an internal memory of size M and an external 
memory (like a disk), which initially stores an input of size N. The external memory is divided into blocks of 
size B and we can read or write any block in an atomic action called an I/O. We say that an external-memory 
sorting algorithm is data-oblivious if the sequence of I/Os that it performs is independent of the values of 
the data it is processing. Unfortunately, many of the existing efficient external-memory sorting algorithms 
are not oblivious in this sense (e.g., see [l2l|42l). Alternatively, existing oblivious external-memory sorting 
algorithms liTTI are not fully scalable to memories of size 0{'rf), or even smaller, which is what we need 
here. So we describe an efficient external-memory oblivious sorting algorithm, which is an external-memory 
k-way modular mergesort that is an external-memory adaptation of Lee and Batcher's generalization [f32l| 
of odd-even mergesort [i8j. It uses a data-obhvious sequence of 0{{N/B) loglj, ^{N/B)) I/Os. 



In the sections that follow, we describe how we achieve each of the results outlined above. We begin 
with our parallel algorithm for cuckoo hashing. 

3 MapReduce Cuckoo Hashing 

In this section, we describe our MapReduce algorithm for setting up a cuckoo hashing scheme. We begin 
by reviewing the MapReduce paradigm. 

3.1 The MapReduce Paradigm 

In the MapReduce paradigm (e.g., see lfT9l l28l ). a parallel computation is defined on a set of values, 
{xi, X2, . . • , Xn}, and consists of a series of map, shuffle, and reduce steps: 

• A map step applies a mapping function, /i, to each value, xi, to produce a key-value pair, {ki,Vi). To 
allow for parallel execution, the function, //(xj) — >• {ki,Vi), must depend only on Xj. 

• A shuffle step takes all the key- value pairs produced in the previous map step, and produces a set of 
lists, Lk = [k] Vi^ ,Vi^, . . .), where each such list consists of all the values, Vi- , such that ki- = A; for a 
key k assigned in the map step. 

• A reduce step applies a reduction function, p, to each list, L^ = {k; Vi^ , fjj , . . .), formed in the shuffle 
step, to produce a set of values, wi,W2, ■ ■ ■ ■ The reduction function, p, is allowed to be defined 
sequentially on L^, but should be independent of other lists L^/ where k' / k. 

Since we are using a MapReduce algorithm as a means to an end, rather than as an end in itself, we allow 
values produced at the end of a reduce step to be of two types: final values, which should be included in 
the output of such an algorithm when it completes and are not included in the set of values given as input 
to the next map step, and non-final values, which are to be used as the input to the next map step. Thus, 
for our purposes, a MapReduce computation continues performing map, shuffle, and reduce steps until the 
last reduce step is executed, at which point we output all the final values produced over the course of the 
algorithm. 

In the MRC version of this model |l28l|, the computation of p is restricted to use only 0{'n}~'^) working 
storage, for a constant e > 0. In the MUD version of this model ifTOl, which we call the streaming- 
MapReduce model, the computation of p is restricted to be a streaming algorithm that uses only 0(log^ n) 
working storage, for a constant c > 0. Given our interest in applications to data-oblivious computations, 
we define a version that further restricts the computation of p to be a streaming algorithm that uses only 
0(1) working storage. That is, we focus on a streaming-MapReduce model where c = 0, which we call 
the sparse-streaming-MapReduce model. In applying this paradigm to solve some problem, we assume we 
are initially given a set of n values as input, for which we then perform t steps of map-shuffle-reduce, as 
specified by a sparse-streaming-MapReduce algorithm, A. 

The performance complexity of a MapReduce algorithm can be measured in several ways (e.g., see |[T9l 
|28l ). For instance, we can count the number of map-shuffle-reduce steps, t. In our case, however, we 
are more interested in the performance issues involved in data-oblivious simulation of sparse-streaming- 
MapReduce algorithms. Let us define the message complexity of a MapReduce to be the total size of all the 
inputs and outputs to all the map, shuffle, and reduce steps in a MapReduce algorithm. That is, if we let Ui 
denote the total size of the input and output sets for the ith phase of map, shuffle, and reduce steps, then 
the message complexity of a MapReduce algorithm is J2i ^i- This correlates to the notion of "work" in a 
traditional parallel algorithm 1,27,1 . 



Suppose we have a function, f{i, n), such that rij < f{i,n), for each phase i, over all possible executions 
of a MapReduce algorithm, A, that begins with an input of size n. In this case, let us say that / is a 
ceiling function for A. Such a function is useful for bounding the worst-case performance overhead for a 
MapReduce computation, especially if it is implemented in a distributed system like Hadoop 1.43,1 . 

3.2 A MapReduce Algorithm for Cuckoo Hashing 

Let us now describe an efficient algorithm for setting up a cuckoo hashing scheme for a given set, X = 
{xi, X2, • • • , Xn}, of items, which we assume come from a universe that can be linearly ordered in some 
arbitrary fashion. Let Ti and T2 be the two tables that we are to use for cuckoo hashing and let hi and /12 
be two candidate hash functions that we are intending to use as well. 

Before we give the details of our algorithm, let us give the intuition behind it. For each Xi in X, recall 
that hi{xi) and h2{xi) are the two possible locations for Xi in Ti and T2. We can define a bipartite graph, G, 
commonly called the cuckoo graph, with vertex sets U = {/ii(xj) : Xi G X} and W = {h2{xi) : xi G X} 
and edge set E = {(/ii(xj), h2{xi)) : Xi G X}. That is, for each edge {u, v) in E, there is an associated 
value Xi such that {u,v) = (/ii(xj), h2{xi)), with parallel edges allowed. Imagine for a moment that an 
oracle identifies for us each connected component in G and labels each node v mG with the smallest item 
belonging to an edge of f 's connected component. Then we could initiate a breadth-first search from the 
node uinU such that hi{xi) = u and Xi is the smallest item in n's connected component, to define a BFS 
tree T rooted at u. For each non-root node v in T, we can store the item Xj at v, where Xj defines the 
edge from v to its parent in T. (Similar ideas have appeared in the different context of history-independent 
cuckoo hashing [35].) 

If a connected component C in G is in fact a tree, then this breadth-first scheme will accommodate all 
the items associated with edges of C. Otherwise, if G contains some non-tree edges with respect to its BFS 
tree, then we pick one such edge, e. All other non-tree edges belong to items that are going to have to be 
stored in the stash. For the one chosen non-tree edge, e, we assign e's item to one of e's endvertices, w, 
and we perform a "cuckoo" action along the path, tt, from w up to the root of its BFS tree, moving each 
item on vr from its current node to the parent of this node on vr. Therefore, we can completely accommodate 
all items associated with unicyclic subgraphs or tree subgraphs for their connected components. All other 
items are stored in the stash. For cuckoo graphs corresponding to hash tables with load less than 1/2, with 
high probability there are no components with two or more non-tree edges, and the stash further increases 
the probability that, when such edges exist, they can be handled. 

Unfortunately, we don't have an oracle to initiate the above algorithm. Instead, we essentially perform 
the above algorithm in parallel, starting from all nodes in U, assuming they are the root of a BFS tree. 
Whenever we discover a node should belong to a different BFS tree, we simply ignore all the work we did 
previously for this node and continue the computation for the "winning" BFS tree (based on the smallest item 
in that connected component). In Figure[T} we describe an efficient MapReduce algorithm for performing n 
simultaneous breadth-first searches such that, any time two searches "collide," the search that started from 
a lower-numbered vertex is the one that succeeds. We can easily convert this into an algorithm for cuckoo 
hashing by adding steps that process non-tree edges in a BFS search. For the first such edge we encounter, 
we initiate a reverse cuckoo operation, to allocate items in the induced cycle. For all other non-tree edges, 
we allocate their associated items to the stash. 

Intuitively, the BFS initiated from the minimum-numbered vertex, v, in a connected component propa- 
gates out in a wave, bounces at the leaves of this BFS tree, returns back to v to confirm it as the root, and then 
propagates back down the BFS tree to finalize all the members of this BFS tree. Thus, in time proportional 
to the depth of this BFS tree (which, in turn, is at most the size of this BFS tree), we will finalize all 
the members of this BFS tree. And once these vertices are finalized, we no longer need to process them 
any more. Moreover, this same argument applies to the modified BFS that performs the cuckoo actions. 



Therefore, we process each connected component in the cuckoo graph in a number of iterations that is, in 
the worst-case, equal to three times the size of each such component (since the waves of the BFS move 
down-up-down, passing over each vertex three times). 

To bound both the time for the parallel BFS algorithm to run and to bound its total work, we require 
bounds on the component sizes that arise in the cuckoo graph. Such bounds naturally appear in previous 
analyses of cuckoo hashing. In particular, the following result is proven in |[30l [Lemma 2.4]. 

Lemma 1: Let v be any fixed vertex in the cuckoo graph and let C^ be its component. Then there exists a 
constant f3 G (0, 1) such that for any integer k > 0, 

Pr(|a| > A;) < /3^ 

More detailed results concerning the asymptotics of the distribution of component sizes for cuckoo hash 
tables can be found in, for example jTSl . although the above result is sufficient to prove linear message- 
complexity overhead. 

Lemma[T]immediately implies that the MapReduce BFS algorithm (and the extension to cuckoo hashing) 
takes O(logn) time to complete with high probability. Note that a direct application of Lemma [T] gives a 
with very high probability result if we allow CL!(log n) time to complete; however, we do not need this result 
to hold with very high probability for our oblivious simulation. 

In addition, we have the following: 

Lemma 2: The message complexity of the MapReduce BFS algorithm is 0{n) with very high probabihty. 
Proof: The message complexity is bounded by a constant times J2v I ^t' I > which in expectation is 



E 



Ei^^i 



X]E[|a|] < 2mEPr(a >k)<2m^(3'' = 0{m). 



fe>0 fc>0 



To prove a very high probability bound, we use a variant of Azuma's inequality specific to this situation. 
If all component sizes were bounded by say 0(log^ n), then a change in any single edge in the cuckoo 
graph could affect ^^ |C„| by only O(log^n), and we could directly apply Azuma's inequality to the 
Doob martingale obtained by exposing the edges of the cuckoo graph one at a time. Unfortunately, all 
component sizes are 0(log^ n) only with very high probability. However, standai^d results yield that one 
can simply add in the probability of a "bad event" to a suitable tail bound, in this case the bad event 
being that some component size is larger than cilog^n for some suitable constant ci. Specifically, we 
directly utilize Theorem 3.7 from 1331 . which allows us to conclude that if the probability of a bad event is 
a superpolynomially small 6, then 



Pr( j;ia|>EE[|a|] + AJ <e- 



where C2 is again a suitably chosen constant. Now choosing A = n^'^, for example, suffices. ■ 

4 Simulating a MapReduce Algorithm Obliviously 

In this section, we describe efficient oblivious simulations for sparse-streaming-MapReduce computations, 
which imply efficient oblivious constructions for cuckoo hashing. 



4.1 A Reduction to Oblivious Sorting 

Our simulation is based on a reduction to oblivious sorting. 

Theorem 3: Suppose A is a sparse-streaming-MapReduce algorithm that runs in at most t map-shufHe- 
reduce steps, and suppose further that we have a ceiling function, f, for A. Then we can simulate A in 
a data-oblivious fashion in the RAM model in time 0(^*^^ o-sort{f{i,n))), where o-sort{n) is the time 
needed to sort n items in a data-oblivious fashion. 

Proof: Let us consider how we can simulate the map, shuffle, and reduce steps in phase i of algorithm A 
in a data-oblivious way. We assume inductively that we store the input values for phase i in an array, Xi. 
Let us also assume inductively that Xi can store values that were created as final values the step i — 1. A 
single scan through the first /(i, n) values of Xi, applying the map function, /x, as we go, produces all the 
key-value pairs for the map step in phase i (where we output a dummy value for each input value that is 
final or is itself a dummy value). We can store each computed value, one by one, in an oblivious fashion 
using an output an^ay Y. We then obliviously sort Y by keys to bring together all key-value pairs with the 
same key as consecutive cells in Y (with dummy values taken to be larger than all real keys). This takes 
time 0(o-sort(/(i, n)). Let us then do a scan of the first f{i, n) cells in Y to simulate the reduce step. As 
we consider each item z in Y , we can keep a constant number of state variables as registers in our RAM, 
which collectively maintain the key value, k, we are considering, the internal state of registers needed to 
compute p on z, and the output values produced by p on z. This size bound is due to the fact that ^ is a 
sparse-streaming-MapReduce algorithm. Since the total size of this state is constant, the total number of 
output values that could possibly be produced by p on an input z can be determined a priori and bounded by 
a constant, d. So, for each value z in Y , we write d values to an output array Z, according to the function 
p, padding with dummy values if needed. The total size of Z is therefore 0{d f{i, n)), which is 0{f{i, n)). 
Still, we cannot simply make Z the input for the next map-shuffle-reduce step at this point, since we need 
the input array to have at most f{i, n) values. Otherwise, we would have an input array that is a factor 
of d too large for the next phase of the algorithm A. So we perform a data-oblivious sorting of Z, with 
dummy values taken to be larger than all real values, and then we copy the first /(z, n) values of Z to Xj+i 
to serve as the input array for the next step to continue the inductive argument. The total time needed to 
perform step i is 0(o-sort(/(z, n)). When we have completed processing of step t, we concatenate all the 
Xj's together, flagging all the final values as being the output values for the algorithm A, which can be done 
in a single data-oblivious scan. Therefore, we can simulate each step of ^ in a data-oblivious fashion and 
produce the output from A, as well, at the end. Since we do two sorts on arrays of size 0{f{i, n)) in each 
map-shuffle-reduce step, i, of A, this simulation takes time 



0(J^o-sort(/(i,n))). 



In Appendix [B| we show that by combining this result with Lemma|2]we get the following: 

Theorem 4: Given a set ofn distinct items and corresponding hash values, there is a data-oblivious algo- 
rithm for constructing a two-table cuckoo hashing scheme of size 0{n) with a stash of size s, whenever this 
stash size is sufficient, in 0{o-sort{n + s)) time. 

Recall that for the constant-memory case, o-sort(n) is 0{n log n). 



4.2 External-Memory Data- Oblivious Sorting 

In this section, we give our efficient external-memory oblivious sorting algorithm. Recall that in this model 
memory is divided between an internal memory of size M and an external memory (like a disk), which 
initially stores an input of size N, and that the external memory is divided into blocks of size B, for which 
we can read or write any block in an atomic action called an I/O. In this context, we say that an external- 
memory sorting algorithm is data-oblivious if the sequence of I/Os that it performs is independent of the 
values of the data it is processing. So suppose we are given an unsorted array ^4 of A^ comparable items 
stored in external memory. If A^ < M, then we copy A into our internal memory, sort it, and copy it back 
to disk. Otherwise, we divide A into k = \{M/By/'^~\ subarrays of size N/k and recursively sort each 
subarray. Thus, the remaining task is to merge these subarrays into a single sorted array. 

Let us therefore focus on the task of merging k sorted arrays of size n = N/k each. If nk < M, then we 
copy all the lists into internal memory, merge them, and copy them back to disk. Otherwise, let A[i, j] denote 
the jth element in the ith array. We form a set of m new subproblems, where the pth subproblem involves 
merging the k sorted subarrays defined by A[i, j] elements such that j mod m = p, for m = \{M/B)^'^~\ . 
We form these subproblems by processing each input subarray and filling in the portions of the output 
subarrays from the input, sending full blocks to disk when they fill up (which will happen at deterministic 
moments independent of data values), all of which uses 0{N/B) I/Os. Then we recursively solve all the 
subproblems. Let D[i,j] denote the jth element in the output of the ith subproblem. That is, we can view 
L) as a two-dimensional array, with each row corresponding to the solution to a recursive merge. 

Lemma 5: Each row and column of D is in sorted order and all the elements in column j are less than or 
equal to every element in column j + k. 

Proof: The lemma follows from Theorem 1 of Lee and Batcher 1321 . ■ 

To complete the A;-way merge, then, we imagine that we slide an m x k rectangle across D, from left 
to right. We begin by reading into internal memory the first 2k columns of D. Next, we sort this set of 
elements in internal memory and we output the ordered list of the km smallest elements (holding back a 
small buffer of elements if we don't fill up the last block). Then we read in the next k columns of D (possibly 
reading in k additional blocks for columns beyond this, depending on the block boundaries), and repeat the 
sorting of the items in internal memory and outputting the smallest km elements in order. At any point in 
this algorithm, we may need to have up to 2km + (m + 2)B elements in internal memory, which, under a 
reasonable tall cache assumption (say M > 3B'^), will indeed fit in internal memory. We continue in this 
way until we process all the elements in D. Note that, since we process the items in D from left to right in 
a block fashion, for all possible data values, the algorithm is data-oblivious with respect to I/Os. 

Consider the correctness of this method. Let Di,D2, ■ ■ ■ ,Di denote the subarrays of D of size m. x k 
used in our algorithm. By a slight abuse of notation, we have that Di < D3, by Lemma[5] Thus, the smallest 
mk items in Di U D2 are less than or equal to the items in D3. Likewise, these Tnk items are obviously less 
than the largest mk items in Di U D2- Therefore, the first m,k items output by our algorithm are the smallest 
m/c items in D. Inductively, then, we can now ignore these smallest mk items and repeat this argument with 
the remaining items in D. 

Let us next consider the I/O complexity of this algorithm. The A;-way merge process is a recursive 
procedure of depth 0{logf^j/^{N/B)), such that each level of the recursion is processed using 0{N/B) 
(data-oblivious) I/Os. In addition, there are 0(log^//^(A^/i?)) levels of recursion in the sorting algorithm 
itself, for which this A;-way process performs the merging step in the divide-and-conquer. Thus, we have the 
following. 



Theorem 6: Given an array A of size N comparable items, we can sort A with a data-oblivious extemal- 



memory algorithm that uses 0{{N/B) log\j,Q{N/B)) I/Os, under a tall-cache assumption (M > 3B^). 



Thus, we can use this algorithm and achieve the stated number of I/Os as an instance of the o-sort(A^) 
function. We also have the following. 

Theorem 7: Given a set ofn distinct items and corresponding hash values, there is a data-oblivious algo- 
rithm for constructing a two-table cuckoo hashing scheme of size 0{n) with a stash of size s = O(logn) 
whenever this stash size is sufficient, using a private memory of size 0{n^''^), for a given fixed constant 
r > l,inO{n + s) time. 



l/r^i 



Proof: Combine Theorems UJ and ml with N = n + s, B = 1, and M G 0{n 



5 Oblivious RAM Simulations 

Our data-oblivious simulation of a non-oblivious algorithm on a RAM follows the general approach of 
Goldreich and Ostrovsky [21], but differs from it in some important ways, most particularly in our use 
of cuckoo hashing. We assume throughout that Alice encrypts the data she outsources to Bob using a 
probabilistic encryption scheme, so that multiple encryptions of the same value are extremely likely to be 
different. Thus, each time she stores an encrypted value, there is no way for Bob to correlate this value to 
other values or previous values. So the only remaining information that needs to be protected is the sequence 
of accesses that Alice makes to her data. 

Our description simultaneously covers two separate cases: the constant-sized private memory case with 
very high probability amortized time bounds, and the case of private memory of size 0{n^'''') for some 
constant r > 1. The essential description is the same for these settings, with slight differences in how the 
hash tables are structured as described below. 

We store the n data items in a hierarchy of hash tables, Hf^, -fffc+i> • • •> Hi, where k is an initial starting 
point for our hierarchy and L = log re. Each table. Hi, has capacity for 2* items, which are distributed 
between "real" items, which correspond to memory cells of the RAM, plus "dummy" items, which are 
added for the sake of obliviousness to make the number of items stored in Hi be the appropriate value. The 
starting table, H^, is simply an array that we access exhaustively with each RAM memory read or write. 
The lower-level tables, -fffc+i to Hi, for / determined in the analysis, are standard hash tables with Hi having 
2*"''^ buckets of size O(logre), whereas higher-level tables, -ff^+i to Hl, are cuckoo hash tables, with Hi 
having (1 + e)2*+^ cells and a stash of size s, where s is determined in the analysis and e > is a constant. 
The sizes of the hash tables in this hierarchy increase geometrically; hence the total size of all the hash tables 
is proportional to the size of H^, which is 0(re). Our two settings will differ in the starting points for the 
various types of hash tables in the hierarchy as well as the size of the stash associated with the hash tables. 

For each Hi with i < Lwe. keep a count, di, of the number of times that Hi has been accessed since first 
being constructed as an "empty" hash table containing 2* dummy values, numbered consecutively from —1 
to —2*. For convenience, in what follows, let us think of each hash table Hi with i > I as being a standard 
cuckoo hash table, with a stash of size s = 0(logn) chosen for the sake of a desired superpolynomially- 
small error probability. Initially, every Hi is an empty cuckoo hash table, except for Hl, which contains all 
re = 2^ initial values plus 2^ dummy values. 

We note that there is, unfortunately, some subtlety in the geometric construction of hash tables in 
the setting of constant-sized private memory with very high probability bounds. A problem arises in 
that for small hash tables, of size say polylogarithmic in re, it is not clear that appropriate very high 
probability bounds, which required failures to occur with probability inverse superpolynomial in re, hold with 
logarithmic sized stashes. Such results do not follow from previous work ll30l . which focused on constant- 
sized stashes. However, to keep the simulation time small, we cannot have larger than a logarithmic-sized 



10 



stash (if we are searching it exhaustively), and we require small initial levels to keep the simulation time 
small. 

In Appendix [Cj we show that we can cope with the problem by extending results from [ISOl that loga- 
rithmic sized stashes are sufficient to obtain the necessary probability bounds for hash tables of size that are 
polylogarithmic in n. In order to start our hierarchy with small, logarithmic-sized hash tables, we simply 
use standard hash tables as described above for levels A; + 1 to / = 0(log log n) and use cuckoo hash tables, 
each with a stash of size s = 0(log n), for levels / + 1 to L. 

Access Phase. When we wish to make an access for a data cell at index x, for a read or write, we first look 
in Hk exhaustively to see if it contains an item with key x. Then, we initialize a flag, "found," to false iff 
we have not found x yet. We continue by performing an accesqjto each hash table Hk+i to Hi, which is 
either to x or to a dummy value (if we have already found x). We give the details of this search in Figure [2] 
Our privacy guarantee depends on us never repeating a search. That is, we never perform a lookup for 
the same index, x or d, in the same table, that we have in the past. Thus, after we have performed the above 
lookup for X, we add x to the table H^, possibly with a new data value if the access action we wanted to 
perform was a write. 

Rebuild Phase. With every table Hi, we associate a. potential, pi. Initially, every table has zero potential. 
When we add an element to Hk, we increment its potential. When a table Hi has its potential, pi, reach 2\ 
we reset Pi = and empty the unused values in Hi into iifj+i and add 2* to Pi+i. There are at most 2* such 
unused values, so we pad this set with dummy values to make it be of size exactly 2*; these dummy values 
are included for the sake of obliviousness and can be ignored after they are added to ffi+i. Of course, this 
could cause a cascade of emptyings in some cases, which is fine. 

Once we have performed all necessary emptyings, then for any j < L, Yli=i Pi i^ equal to the number 
of accesses made to Hj since it was last emptied. Thus, we rehash each Hi after it has been accessed 2* 
times. Moreover, we don't need to explicitly store pi with its associated hash table. Hi, as di can be used to 
infer the value oipi. 

The first time we empty Hi into an empty i^i+i, there must have been exactly 2* accesses made to Hi^i 
since it was created. Moreover, the first emptying of Hi into -ffj+i involves the addition of 2* values to i/j+i, 
some of which may be dummy values. 

When we empty Hi into i/j+i for the second time it will actually be time to empty //j+i into i/j+2. 
as Hi^i would have been accessed 2*+^ times by this point — so we can simply union the current (possibly 
padded) contents of Hi and Hi^i together to empty both of them into iifj+2 (possibly with further cascaded 
emptyings). Since the sizes of hash tables in our hierarchy increase geometrically, the size of our final 
rehashing problem will always be proportional to the size of the final hash table that we want to construct. 
Every n = 2^ accesses we reconstruct the entire hierarchy, placing all the current values into H^. Thus, the 
schedule of table emptyings follows a data-oblivious pattern depending only on n. 

Correctness and Analysis. To the adversary. Bob, each lookup in a hash table. Hi, is to one random 
location (if the table is a standard hash table) or to two random locations and the s elements in the stash (if 
the table is a cuckoo hash table), which can be 

1. a search for a real item, x, that is not in Hi, 

2. a search for a real item, x, that is in Hi, 

3. a search for a dummy item, di. 



^This access is actually two accesses if the table is a cuckoo hash table. 
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Moreover, as we search through the levels from A; to L we go through these cases in this order (although 
for any access, we might not ever enter case 1 or case 3, depending on when we find x). In addition, if we 
search for x and don't find it in Hi, we will eventually find x in some Hj for j > i and then insert x in 
Hk; hence, if ever after this point in time we perform a future search for x, it will be found prior to Hi. In 
other words, we will never repeat a search for x in a table Hi. Moreover, we continue performing dummy 
lookups in tables Hj, for j > i, even after we have found the item for cell x in Hi, which are to random 
locations based on a value of di that is also not repeated. Thus, the accesses we make to any table Hi are 
to locations that are chosen independently at random. In addition, so long as our accesses don't violate 
the possibility of our tables being used in a valid cuckoo hashing scheme (which our scheme guarantees 
with very high probability, which follows from Appendix |C]l then all accesses are to independent random 
locations that also happen to correspond to locations that are consistent with a valid cuckoo scheme. Finally, 
note that we rebuild each hash table Hi after we have made 2* accesses to it. Of course, some of these 2* 
accesses may have successfully found their search key, while others could have been for dummy values or 
for unsuccessful searches. Nevertheless, the collection of 2* distinct keys used to perform accesses to Hi 
will either form a standard hash table or a cuckoo graph that supports a cuckoo hash table, with a stash of 
size s, w.v.h.p. Therefore, with very high probability, the adversary will not be able to determine which 
among the search keys resulted in values that were found in Hi, which were to keys not found in Hi, and 
which were to dummy values. 

Each memory access involves at most O(slogn) reads and writes, to the tables ff^ to H^. In addition, 
note that each time an item is moved into H^, either it or a surrogate dummy value may eventually be moved 
from Hk all the way to Hi, participating in 0(log n) rehashings, with very high probability. In the constant- 
memory case, by Theoremffl each data-oblivious rehashing of n^ items takes 0((nj + s) log(nj + s)) time. 
In addition, in this case, we use a stash of size s G 0(log n) and setl = k-\- 0(log log n). In the case of a 
private memory of size 0{n^'^), each data-oblivious rehashing of rii items takes 0{ni) time, by TheoremM 
In addition, in this case, we can use a constant-size stash (i.e., s = 0(1)), but start with k = (1/r) logn, 
so that Hk fits in private memory (with all the other Hi& being in the outsourced memory). The use of a 
constant-sized stash, however, limits us to a result that holds with high probability, instead of with very high 
probability. 

To achieve very high probability in this latter case, we utilize a technique suggested in 122]. Instead 
of using a constant-sized stash in each level, we combine them into a single logarithmic-sized stash, used 
for all levels. This allows the stash at any single level to possibly be larger than any fixed constant, giving 
bounds that hold with very high probability. Instead of searching the stash at each level, Alice must load the 
entire stash into private memory and rewrite it on each memory access. Therefore, we have the following. 

Theorem 8: Data-oblivious RAM simulation of a memory of size n can be done in the constant-size private- 
memory case witti an amortized time overtiead of 0(log^ n), witli very liigh probability. Such a simulation 
can be done in the case of a private memory of size 0{n^/'^) with an amortized time overhead of 0(log n), 
with very high probability, for a constant r > 1. The space needed at the server in all cases is 0{n). 

6 Conclusion and Future Work 

We have given efficient schemes for performing data-oblivious RAM simulations, which provide significant 
improvements for both the constant-memory and sublinear-memory cases. Our techniques involved the 
interplay of a number of new seemingly-unrelated results for parallel and external-memory algorithms. 

Finally, we return to the issue of using completely random hash functions in our analysis. The existence 
of suitable pseudorandom hash functions (or a random oracle) is a standard assumption in this area, although 
very recent work by Ajtai [3| and Damgard et al. fT3l provide methods for oblivious RAM simulations 
without a random oracle, at the cost of a polylogarithmic increase per operation. We expect that similar 
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results would apply here, although the polylogarithmic increase overhead required for hashing would clearly 
be significant if such computations cannot be done completely in private memory and/or they result in 
polylogarithmic server accesses for each hash computation, as in |l3l[l3l. One promising approach, following 
the work of Arbitman, Naor, and Segev |6], would be to use the result of Braverman [10| to show that only 
hash functions that are polylogarithmically independent in n are required to obtain the appropriate cuckoo 
hashing performance. Such functions can be stored in polylogarithmic space and evaluated obliviously in 
polylogarithmic time by reading all of the coefficients. As another possible approach, Aumiiller shows that 
explicit hash functions, requiring sublinear space, can be used to obtain asymptotically the same bounds on 
the performance of cuckoo hashing with a stash Q. However, it is not yet clear these hash functions can be 
utilized in an oblivious fashion, and may require significant private memory to store the hash functions. As 
this is not the focus of our paper, we leave further details as future work. 

There are a number of interesting open problems, including the following: 

• Can one perform data-oblivious RAM simulation in the constant-memory case with an amortized time 
overhead of 0(log n)? 

• Is there a deterministic external-memory data-oblivious sorting method using 0{{N/B) log]^f/^{N/B)) 

yos? 
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for each vertex v do 
Set u. component = v 
Set i;.visitLabel = v 
Set u. parent = v 
Set u. active = true 
Set u. pending = false 
Set w. complete = false 
Set v. finished = false 
for 3D iterations, where D is an upper bound on G's diameter do 
for each vertex v do 
if f .active = true then 
Set t;. active = false. 
Set w. pending = true, 
for each edge {v, w) do 

if tiJ.visitLabel > v. component then {ttj's BFS is no longer valid} 
Let tt). active = true. 

Let d be the minimum component number of a pending neighbor of w. 
Let It;. component = d. 

if u. component = d then {v "wants" to be the parent of w} 
if V is minimum numbered vertex s.t. w. component = d then 
Mark {v, w) as a BFS tree edge. 
Let ^.parent = v. 
else 

Mark {v, w) as a BFS non-tree edge. 
if u. pending = true and all of t;'s children are complete then 

Set t;. complete = true, 
if w. complete = true and the parent of v is finished then 

Set t;. finished = true, 
if v. finished = true and the children of v are finished then 
Mark v as final; the BFS tree is fixed w.r.t. v. 

Figure 1: A MapReduce Algorithm for multiple overlapping breadth-first searches. The "active" fiag is 
used to mark vertices that are actively pushing a wave of the BFS to the next level. The "pending" fiag is 
used to identify vertices waiting for their children to confirm their all belonging to the same BFS tree. The 
"complete" flag confirms this fact, and the "finished" flag confirms that the root is complete and the BFS 
tree is done. Note that every operation in this algorithm involves testing of simple fields, comparisons of 
local fields with a fixed value (which is the same for all members of the same list in a reduce step), or the 
computation of minimums or sums. In addition, we assume that the marking of an edge as a BFS edge is 
an atomic action, so that parallel copies of that same edge would not be marked. Thus, each step of this 
algorithm can be implemented in the sparse-streaming-MapReduce model. 
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for j = A; + 1 to / do 

Read di and store its value in a private variable, d. 
Increment di 's value in external storage. 
if found = false then 

Read the entire bucket Hi[h{x)]. 
if this bucket is holding x then 
Set found = true 

Store X and the contents of data cell x in private storage — we've found it. 
Revisit Hi[h{x)] marking x as "used" and the others as they were. 
else 

Revisit Hi[h{x)] marking all items as they were. 
else 

Read Hi[h{-d)]. 

Revisit Hi[h{—d)], remarking the items in this bucket as they were, 
for i = / + 1 to L do 

Read di and store its value in a private variable, d. 
Increment di 's value in external storage. 
if found = false then 

Read Hi[hi{x)] and Hi[h2{x)] and the stash for Hi. 
if one of these locations, y, is holding x then 
Set found = true 

Store X and the contents of data cell x in private storage — we've found it. 
Revisit Hi[hi{x)] and Hi[h2{x)] and the stash for Hi, marking y as "used" and the others as 
they were. 
else 

Read Hi[hi{—d)] and Hi[h2{—d)] and the stash for Hi. 

Revisit Hi[hi{—d)] and Hi[h2{—d)] and the stash for Hi, remarking them as they were. 

Figure 2: The oblivious method for accessing our hierarchy of outsourced hash tables. 
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A Proof of Cuckoo Component Size Distribution 

In this appendix, we provide a proof of LemmafTl which states that there exists a constant /3 G (0, 1) such 
that for any integer k > 0, 

Pr(|a| > A;) < /3^ 

where v is any fixed vertex in the cuckoo graph and C^ is its component. 

Proof: We use the method of fTSl. Consider a breadth first search starting from some vertex v. The number 
of neighbors of this vertex in the cuckoo graph is bounded by a binomial random variable Bin(n, l/m). 
Now, assuming a neighbor exists, let X2 be the number of additional edges adjacent to this neighbor; it 
is again stochastically bounded by a binomial random variable Bin(n, l/m). Continuing a breadth first 
search in this manner, if we let Xi,X2, ... be random variables such that Xi represents the number of 
new adjacent edges explored at the ith vertex of the breadth first search, and let Yi, 12, • • • be independent 
binomial random variables Bin(n, l/m), we have 

Pr(|C^| > k)< Fr I ^Xi>k] <Fr I ^Yi>k] = Pr (Bin(nA;, l/m) > k) . 

Note nk/m = k/{l + e). Assuming e < 1 (the result is easily extended to other cases) we have via a 
standard Chemoff bound 

Pr (Bin(nA;, l/m) > k) < e-^''=/(3(i+^)). 

Letting (3 = e~^ /(3(i+<:)) gives the result. ■ 



B Proof of Data-Oblivious Cuckoo Hashing Theorem 

In this appendix, we provide a proof of Theorem |4j which states that, given a set of n distinct items, there 
is a data-oblivious algorithm for constructing a cuckoo hash table of size 0{n) with a stash of size s in 
0(o-sort(n + s)) time. 

Proof: Combining Theorem |3] and Lemma [2] we get that there is a data-oblivious way to simulate the 
MapReduce cuckoo hash construction algorithm in 0(o-sort(n)) time. But the output of this algorithm is 
not a table of values. It is instead a set, S, of pairs, (i, x), where i is an index of a non-empty cell in our hash 
table T, and x is the value that should be stored in T[i]. We use the convention here that elements belonging 
to the stash are identified with pairs of the form (0, x). (Note: for simplicity, we also assume here that the 
two tables Ti and T2 are concatenated into one table.) To convert 5 to a standard representation for a cuckoo 
hash table, we perform the following (data-oblivious) algorithm. 

for each {i, x) pair in \S\ do 
Create a tuple, {i, S, x) 

for z = 1 to n do 

Create a tuple, (i, T, 0) 

Sort the tuples from S and T (data-obliviously) by first and second coordinates, in a list L. 

for i = 1 to s do 

Set H[i] to X if L[i] = (0, S, x), otherwise set H[i] = 0. 

Scan L from left-to-right, copying the value x, from a {i,S, x) tuple, with i 7^ 0, to its successor tuple, 

{i, T, 0), making it be (z, T, x). 

Sort the set of tuples in L (data-obliviously) by second coordinates. 

Truncate this list of tuples to only contain those with a T. 

for each tuple [i, T, x) do 
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Set T\i] = X 
This gives us a standard table format for the cuckoo hashing scheme, together with its stash, H. 



C Proof of Bounds on Cuckoo Hashing when the Stash is Larger than Con- 
stant Size 

We present a sketch of the result demonstrating the value of a stash that is greater than constant sized, which 
we require for some of our results. We note that the result here essentially follows using arguments from 
OOI . but requires some non-trivial additional work, which we provide here. 

In what follows let m be the size of each of the two subtables of the cuckoo hash table. Recall that n 
is the number of keys to be stored by Alice, but the subtables used in our construction can be much smaller 
than n; indeed, we aim for m to be polylogarithmic in n, while maintaining failure probabilities that are 
polynomial in n. This will require stashes of size 0(log n). 

To start, we recall some notation from Il30l . Let C^ denote the (edges of) the connected component of 
a node v. Then 7(6') be defined to be the smallest number of graph edges which should be removed from 
G so that no cycle remains. (That is, 7(G) is the cyclomatic number of G.) Denote by T{G) the number of 
cyclic components in G. A simple fact shown in 1.301 is that the number of items placed in the stash equals 
7(G) — T{G). Bounds in |[30l are obtained by bounding -){Cv), the excess for a connected component, and 
applying stochastic dominance arguments. 

Our starting point is ll30l [Lemma 2.8], which states the following. 

Lemma 9 (Lemma 2.8 of Il30l ): For every vertex v and t,k,n > 1, 



Pr{j{C,)>t\\C,\ = k)< 



m 



Recall also we have shown already in Lemma [T] that, for a constant /3, 

Pr(|a| > A;) < /3^ 
Combining this we see that 



Pr(7(C.) > t) < ^Pr(7(a) > t \ \G,\ = k) ■ Pr{\G,\ > k) 



fc=i \ ^ 



In ||30ll it is noted that the right hand side above is 0{m^^) when t is a constant, but here, where we 
may have stashes larger than constant size, we need to consider t of up to O(logn). However, in the 
summation above, we need only concern ourselves with values of k that are 0(log^ n), as larger k values 
yield superpolynomially small terms in the summation, in which case as long as m is ri(log^ n), then we 
have Pr(7(C^) > t) is m^^^^h Specifically, we claim that Pr(7(Ct,) > j ' + 1) is at most m~^~°-' for some 
constant a. Note that this derivation requires m to be polylogarithmic in n. 
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Now, following the derivation at end of Theorem 2.2 of OOl . we have that the probability that the stash 
exceeds a total size s, where s should be taken as 0(log n), is given by 



Pr(7(G') - T{G) > .) < E (T) 

k=i ^ ^ 



gk^-as-k 



2m /r,„„„ „\ k 

k,^„—as—k 



^ E f ^) »-» 



A;=l 



\ k J 



2m ,2g^,fc 



m 



-n{s) 



The last line follows since the summation is polynomial in n, and m~^^^^ is superpolynomially small in n, 
given our restrictions on m and s. 

(We note there are some changes from the end of Theorem 2.2 of |[30l; specifically, there is a typograph- 
ical error, where in the original theorem they have k^ in place of s'^. The term is a bound for the number of 
ways k positive numbers can sum to s, which is easily bounded by s'^; the error does not affect the results 
of ll30l but the difference is important here.) 

It would be interesting to obtain better bounds on stashes of greater than constant size. In this case, 
it could simplify our construction, if we could instead use cuckoo hash tables from the initial point of our 
construction with constant private memory. 

We emphasize that this result that keys can be placed in a cuckoo hash table with a suitably sized stash 
with very high probability is used in two places in our argument. First, it must apply to the data items actually 
stored in the hash table; if the items cannot be stored, the RAM simulation fails. Second, it must apply to 
the sequence of locations examined during the actual search process. That is, in running our simulation, 
we sometime search for real items and sometimes for dummy items in the hash tables. To Bob, searching 
for dummy items must not be distinguishable from searching for real items. In particular, we must have it 
be the case that in a given cuckoo hash table, the real and dummy items searched for correspond to a valid 
placement in the cuckoo hash table. As our argument above does not distinguish between real and dummy 
items, we can simply union over the two cases (real items, and the mix of real and dummy items) to maintain 
our high probability bound. 

D A Flaw in the Construction of Pinkas and Reinman 

As mentioned above, Pinkas and Reinman [37] published an oblivious RAM simulation result for the 
case where Alice maintains a constant-size private memory, claiming that Alice can achieve an expected 
amortized overhead of 0(log^ n) while using 0{n) storage space at the data outsourcer. Bob. Unfortunately, 
their construction contains a flaw that allows Bob to learn Alice's access sequence, with high probability, in 
some cases, which our construction avoids. 

Like many of the recent oblivious RAM simulation results, their construction involves the use of a 
hierarchy of 0(log n) hash tables, which in their case are cuckoo hash tables (without stashes) that start out 
(at their highest level) being of a constant size and double in size with every level. 

Consider now a stage in the simulation when most of the levels are full and the client does a lookup for 
a set of items, xi, . . . , x^, which are still on the bottom level, having never been accessed before. In this 
construction, Alice will do a cuckoo lookup for each of the Xj's in all the levels in this case, finding them 
only at the bottom. But at the higher (smaller) levels, these Xj's are not in the cuckoo tables. Moreover, 
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these Zj's were not considered (even as "dummy elements") when constructing these cuckoo tables. So it is 
quite likely, especially for the really small cuckoo tables, that the sequence of lookups done for xi, . . . , x^ 
will turn out to produce a set of lookups that cannot possibly be for a set of items that are all contained in 
the cuckoo table. That is, they are likely to form a connected component in the cuckoo graph with more 
than one cycle (with at least constant probability for the really small tables). Thus, the adversary will learn 
in this case that Alice was searching for at least one item that was not found in that small table. Put another 
way, with reasonably high probability (in terms of k), this access sequence would be distinguishable from 
one, say, where Alice kept looking up the same item, x, over and over again, for this latter sequence will 
always succeed in finding dummy elements in the smaller tables (hence, will always produce a set of valid 
cuckoo-table lookups). 

Incidentally, in a private communication, we have communicated this concern to Pinkas and Reinman, 
and they have indicated that they plan to repair this flaw in the journal version of their paper. 
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